Reinforcement Learning in Continuous Time: Advantage Updating
نویسنده
چکیده
A new algorithm for reinforcement learning, advantage updating, is described. Advantage updating is a direct learning technique; it does not require a model to be given or learned. It is incremental, requiring only a constant amount of calculation per time step, independent of the number of possible actions, possible outcomes from a given action, or number of states. Analysis and simulation indicate that advantage updating is applicable to reinforcement learning systems working in continuous time (or discrete time with small time steps) for which standard algorithms such as Q-learning are not applicable. Simulation results are presented indicating that for a simple linear quadratic regulator (LQR) problem, advantage updating learns more quickly than Q-learning by a factor of 100,000 when the time step is small. Even for large time steps, advantage updating is never slower than Q-learning, and advantage updating is more resistant to noise than is Q-learning. Convergence properties are discussed. It is proved that the learning rule for advantage updating converges to the optimal policy with probability one. REINFORCEMENT LEARNING A reinforcement learning problem is an optimal control problem where the controller is given a scalar reinforcement signal (or cost function) indicating how well it is performing. The reinforcement signal is a function of the state of the system being controlled and the control signals chosen by the controller. The goal is to maximize the expected total discounted reinforcement, which for continuous time is defined as E γ t r ( x t , u t ) dt
منابع مشابه
Reinforcement Learning Applied to a Differential Game
An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual-gradient form of advantage updating. The game is a Markov decision process with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile and a p...
متن کاملAdvantage Updating Applied to a Differrential Game
An application of reinforcement learning to a linear-quadratic, differential game is presented. The reinforcement learning system uses a recently developed algorithm, the residual gradient form of advantage updating. The game is a Markov Decision Process (MDP) with continuous time, states, and actions, linear dynamics, and a quadratic cost function. The game consists of two players, a missile a...
متن کاملMulti-Player Residual Advantage Learning With General Function Approximation
A new algorithm, advantage learning, is presented that improves on advantage updating by requiring that a single function be learned rather than two. Furthermore, advantage learning requires only a single type of update, the learning update, while advantage updating requires two different types of updates, a learning update and a normilization update. The reinforcement learning system uses the ...
متن کاملIntegral Policy Iterations for Reinforcement Learning Problems in Continuous Time and Space
Policy iteration (PI) is a recursive process of policy evaluation and improvement to solve an optimal decision-making, e.g., reinforcement learning (RL) or optimal control problem and has served as the fundamental to develop RL methods. Motivated by integral PI (IPI) schemes in optimal control and RL methods in continuous time and space (CTS), this paper proposes on-policy IPI to solve the gene...
متن کاملAdvantages of Cooperation Between Reinforcement Learning Agents in DiÆcult Stochastic Problems
| This paper presents the rst results in understanding the reasons for cooperative advantage between reinforcement learning agents. We consider a cooperation method which consists of using and updating a common policy. We tested this method on a complex fuzzy reinforcement learning problem and found that cooperation brings larger than expected bene ts. More precisely, we found that K cooperativ...
متن کامل